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Abstract 


In  this  report  we  shall  describe  a  method  for  fitting  variable  knot  spline  models  to 
noisy  univariate  data  which  uses  a  genetic  algorithm  to  optimize  over  the  number  and 
location  of  the  knots.  For  a  fixed  number  of  knots,  the  location  of  the  knots  is  chosen 
to  minimize  the  sum  of  squares  error;  the  appropriate  number  of  knots  is  determined 
by  the  adjusted  GCV  criterion  of  Luo  and  Wahba  (1997).  The  objective  is  to  find  the 
model  which  minimizes  RSS/clf^  where  the  degrees  of  freedom  are  inflated  to  reflect 
the  ada.ptive  nature  of  the  knot  search  (i.e.,  selection  of  bcisis  functions).  We  justify 
theoretically  that  our  algorithm  will  converge  to  the  variable  knot  model  which  op¬ 
timizes  the  model  fitting  criterion,  given  that  this  model  is  contained  in  the  search 
space.  A  modified  bootstrap  technique  is  used  to  obtain  pointwise  standard  errors  for 
models  obtained  by  the  GA  method.  Experimental  results  comparing  the  performance 
of  the  proposed  algorithm  to  those  obtained  using  the  non-linear  optimization  tech¬ 
nique  of  Schwetlick  and  Schiitze  (1995),  the  genetic  algorithm  proposed  by  Manela  et. 
al.  (1993),^  and  the  method  of  Luo  and  Wahba  (1997)  are  presented.  We  also  discuss 
the  extension  our  technique  to  related  problems. 


Key  Words:  Genetic  algorithms,  variable  knot  splines,  least  squares,  model  selection, 
statistical  data  analysis. 


1  Introduction 


Suppose  we  are  asked  to  analyze  a  set  of  N  measurements  {(x,-,i/,)  :  i  -  in 

a  <  Xi  <  X2  <  ■  ■  ■  <  xn  <  b,  where  the  system  which  generated  the  data  can  be 
described  by  an  equation  of  the  form  t/j  =  /(x,)  +  e,’  for  some  unknown  function  /.  The 
e^s  are  assumed  to  be  independent  and  follow  some  distribution  with  mean  zero.  One 
approach  to  this  problem  is  to  attempt  to  replace  the  noisy  and/or  complex  relationship 
which  the  data  represent  by  something  simple  but  reasonable  which  captures  the  nature 
of  the  dependence  of  y  on  x.  To  achieve  this  goal  we  assume  that  /  can  be  approximated 
by  a  continuous  piecewise  polynomial  which  satisfies  certain  continuity  conditions  on 
its  lower  order  derivatives.  A  natural  measure  of  the  quality  of  such  an  approximation 
is  a  statistic  based  on  the  sum  of  the  squares  error  adjusted  to  avoid  ‘overfitting’  of 
the  data.  In  this  context,  a  rich,  flexible  class  of  models  is  the  space  of  spline  functions 
of  order  m  with  some  knot  sequence  t. 

The  choice  of  knot  sequence  t,  in  terms  of  both  the  number  of  knots  and  their  place¬ 
ment,  can  be  crucial  for  both  the  shape  and  the  quality  of  a  spline  fit  [1].  By  performing 
afl  optimization  over  the  number  and  location  of  the  knots,  as  well  as  the  spline  coeffi¬ 
cients,  the  model  fit  can  in  general  be  significantly  improved.  However,  determining  the 
best  model  over  this  space  of  variable  knot  splines  is  a  very  poorly  behaved  non-linear 
optimization  problem  [2]. 

In  this  paper  we  propose  a  method  which  uses  genetic  algorithms  to  determine  the 
number  and  location  of  the  knots.  For  a  fixed  number  of  knots  the  location  of  the 
knots  is  chosen  to  minimize  the  sum  of  squares  errorj  the  appropriate  number  of 
knots  is  determined  by  selecting  the  model  which  minimizes  this  error  adjusted  for  the 
number  of  knots  or  degrees  of  freedom’  used  in  the  fit.  Although  we  consider  only 
least  squares  splines,  our  method  can  be  generalized  to  spline  fitting  by  penalized  least 
squares  or  more  robust  criteria. 

Recent  contributions  to  the  problem  of  fitting  variable  knot  splines  have  been  made  by 
several  authors.  Larson  [3]  considers  linear  least  squares  splines  with  unknown  knot 
locations  while  Hu  [4]  presents  an  algorithm  similar  in  spirit  to  that  of  de  Boor  [5] 
which  seeks  an  optimal  knot  distribution.  Schwetlick  and  Shviltze  [6]  have  devised  a 
nonlinear  optimization  routine  which  searches  for  the  optimal  least  squares  (LS)  knot 
placement.  The  number  of  knots  is  chosen  as  the  fewest  number  necessary  to  model  the 
data  within  some  estimate  of  the  noise  level.  From  the  literature  on  genetic  algorithms, 
Manela  et.  al.  [7]  have  applied  genetic  algorithms  to  penalized  least  squares  spline 
fitting  with  optimization  over  the  number  of  knots,  the  order  of  the  spline,  and  the 
smoothing  parameter.  They  use  the  method  of  Dierckx  [8]  to  get  a  reasonable  knot 
distribution  for  a  fixed  number  of  knots;  his  recommendations  on  possible  values  for 
the  order  and  number  of  knots  are  used  to  constrain  the  search  space. 

A  general  discussion  of  methods  for  approximation  by  variable  knot  splines  and  the 
motivation  for  the  proposed  method  are  provided  in  Section  2.  Section  3  contains 
an  overview  of  genetic  algorithms  as  a  tool  for  finding  a  near  optimal  solution  to  the 
preset  problem  .  A  specific  formulation  of  this  problem  followed  by  a  presentation  of 
the  GA  method  and  a  justification  of  its  convergence  are  given  in  Section  4.  Imple¬ 
mentation  details,  proposed  experiments,  and  expected  results  are  discussed  in  Section 


1 


5.  We  conclude  with  general  comments  and  directions  for  future  research  in  Section  6. 

Only  a  brief  introduction  to  spline  functions  will  be  presented  here;  an  excellent  in¬ 
troduction  to  spline  theory  and  their  applications  to  data  analysis  can  be  found  in 
de  Boor  [5],  Wahba  [9],  Wegman  and  Wright  [10],  and  Eubank  [11].  A  nice  overview 
of  algorithms  for  ‘free’  knot  splines  similar  to  the  one  given  below  can  be  found  in 
Schwetlick  and  Schultze  [6]. 


2  Existing  Methodology 


For  a  fixed  knot  sequence,  numerically  stable  algorithms  for  the  computation  of  the  LS 
spline  model  can  be  found  in  the  literature;  see,  e.g.,  de  Boor  [5].  However,  as  noted 
earlier,  ‘freeing’  the  knots  in  the  sense  of  optimizing  their  placement  and  number  yields 
a  superior  spline  model  in  most  cases.  In  essence,  the  number  of  knots  can  be  viewed 
as  equivalent  to  the  window  size  in  a  smoothing  model  -  more  knots  leads  to  more 
smoothing.  ‘Freeing’  the  knots  is  comparable  to  using  a  smoother  whose  window  size 
is  locally  adaptive  -  allowing  a  smaller  window  size  (the  placement  of  more  knots)  in 
areas  where  more  flexibility  is  needed,  as  in  places  of  peaks  or  high  curvature,  and  a 
larger  window  size  (the  placement  of  less  knots)  in  areas  of  relatively  low  variability. 
In  this  way  the  number  of  knots  and  their  placement  can  make  a  difference  in  terms 
of  quality  of  fit  [12].  Hence  the  selection  of  a  least  squares  spline  model  should  involve 
some  adaptive  procedure  for  determining  the  number  and  location  of  the  knots,  i.e,, 
the  knot  sequence  t. 

Some  of  the  earliest  attempts  at  fitting  variable  knot  splines  include  those  of  Bellman 
and  Roth  [13]  for  linear  splines  and  Hawkins’  [14]  method  based  on  dynamic  program¬ 
ming.  These  have  been  followed  by  algorithms  which  iteratively  add  or  delete  knots 
but  do  not  attempt  optimal  knot  placement,  only  a  distribution  for  the  knots  which 
is  in  some  sense  optimal.  Included  here  is  de  Boor’s  algorithm  NEWKNOT  [5],  the 
method  of  Agarwal  and  Studden  [15]  for  linear  splines,  and  the  recent  contribution 
of  Hu  [4].  Manela  et.  al.  [7]  avoid  the  iterative  addition  and  deletion  of  knots  by 
using  genetic  algorithms  to  fit  penalized  LS  splines.  They  optimize  the  choice  of  the 
smoothing  parameter  and  allow  the  order  of  the  spline  and  the  number  of  knots  to 
vary  (within  certain  constraints);  for  a  fixed  number  of  knots  they  utilize  Dierckx’s 
algorithm  [8]  to  determine  the  knot  distribution. 

There  are  also  nonlinear  optimization  routines  which  attempt  to  optimize  knot  place¬ 
ment  with  respect  to  LS  error  for  a  fixed,  given  number  of  knots,  de  Boor  and  Rice 
[16]^  and  Jupp  [17]  consider  LS  cubic  spline  approximation  and,  beginning  with  an 
initial  knot  sequence,  search  for  the  best  knot  placement  by  Gauss- New  ton  methods. 
The  algorithm  of  Dierckx  [18]  is  similar  but  employs  the  Fletcher/Reeves  eg  method. 
Schwetlick  and  Schiitze  [6]  have  presented  a  nice  extension  of  these  works  based  on 
generalized  Gauss-Newton  methods  which  allows  for  the  optimal  placement  of  a  subset 
of  the  knot  sequence.  They  include  a  knot  removal  strategy  (see  Lyche  and  M0rken 
[19])  to  search  for  the  model  which  approximates  the  data  to  within  its  estimated  noise 
level  using  the  fewest  number  of  knots. 


2 


L 


Variable  knot  splines  have  been  suggested  in  the  statistical  literature  for  several  data 
analytic  applications,  the  most  predominant  of  which  has  been  additive  modeling.  We 
shall  only  mention  a  few  selected  additive  modeling  methods  with  respect  to  regression 
spline  basis  selection;  see  Hastie  and  Tibshirani  [12]  for  a  more  thorough  review  of 
additive  models.  With  regression  splines  and  additive  modeling,  the  knot  locations 
are  restricted  to  the  set  of  design  points  and  the  problem  of  finding  the  optimal  set 
of  knots  is  approached  from  the  perspective  of  model  selection.  The  proper  choice  of 
knot  sequence  is  usually  determined  by  a  stepwise  addition  and/or  deletion  procedure 
where  only  models  whose  size  is  within  a  certain  range  are  considered.  The  appropriate 
model  is  taken  as  the  one  which  minimizes  a  criterion  based  on,  for  example,  generalized 
cross-validation  (GCV)  [20]  or  Mallows’  Cp  statistic  [21].  Alternatively,  a  generalized 
backfitting  approach  similar  to  that  of  Breiman  [22]  could  be  devised  where  knot 
selection  is  incorporated  into  the  iterative  process  of  determining  the  current  choice  of 
each  smoothing  matrix.  This  possibility  will  not  be  pursued  here. 

Methods  which  begin  with  many  knots  and  then  delete  those  which  are  least  important 
with  respect  to  approximation  and/or  prediction  error  include  those  of  Smith  [23] 
and  Lyche  and  M0rcken  [19].  Friedman  and  Silverman’s  [24]  algorithm  TURBO  does 
stepwise  addition  of  knots  to  get  a  series  of  models;  of  these,  the  model  which  minimizes 
a  GCV  criterion  is  chosen.  Knots  are  then  deleted  from  this  model  to  achieve  the  final 
fit.  In  MARS,  Friedman  [25]  utilizes  a  stepwise  knot  addition/deletion  strategy  for 
linear  splines.  His  GCV  criterion  includes  a  ‘cost’  term  to  adjust  for  the  adaptive  nature 
of  the  knot  selection,  as  suggested  by  Hastie  [26].  Breiman’s  [27]  DK/CV  algorithm  is 
based  on  Smith’s  work;  it  uses  knot  deletion  combined  with  a  CV  criterion  for  model 
selection.  His  work  has  shown  DK/CV  to  be  superior  on  small,  noisy  datasets  to  ACE 
[28]  which  uses  a  backfitting  approach.  A  recent  contribution  to  additive  modeling  has 
been  made  by  Luo  and  Wahba  [29]  with  their  algorithm  HAS.  They  peidorm  forward 
knot  selection  via  GCV  with  a  ‘cost’  term  but  replace  backward  deletion  by  ridge 
regression  to  reduce  model  variability  and  improve  numerical  stability. 

Unlike  the  algorithms  mentioned,  the  proposed  method  combines  the  following: 

•  The  knots  are  not  restricted  to  the  class  of  design  points  and  for  a  given  number 
of  knots  we  determine  the  knot  sequence  which  minimizes  the  sum  of  squares 
error.  Hence  we  simultaneously  optimize  the  choice  of  model  coefficients  as  well 
as  the  number  of  knots  and  their  placement.  For  implementation  reasons  we 
restrict  ourselves  to  splines  with  simple  knots. 

•  As  opposed  to  utilizing  a  stepwise  procedure,  which  is  known  to  be  suboptimal, 
we  allow  a  genetic  algorithm  to  search  over  the  space  of  models  with  k  interior 
knots,  k  <  fcmax,  and  use  a  model  selection  criterion  based  on  GCV  to  choose  the 
model  which  minimizes  error  yet  contains  an  appropriate  number  of  knots. 

•  It  has  been  noted  by  Jupp  [17]  that  with  ‘free’  knot  LS  splines,  the  error  function 
is  nonconvex  in  the  knot  sequence  t.  Hence  nonlinear  optimization  algorithms 
which  use  generalized  Gauss-Newton  and/or  conjugate  gradient  methods  can 
converge  to  local  mifiima  or  saddlepoints.  In  contrast,  it  has  been  shown  theo¬ 
retically  (see  Section  4.5)  that  if  the  global  optimum  is  contained  in  the  search 
space  then  the  proposed  genetic  algorithm  will  converge  to  that  global  optimum 
regardless  of  the  shape  of  the  function  being  optimized.  It  should  also  be  noted 
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that  the  result  of  non-linear  optimization  often  depends  on  the  choice  of  initial 
knot  sequence  whereas  genetic  algorithms  are  relatively  independent  of  initial 
parameter  estimates. 

Before  discussing  the  details  of  the  proposed  method  we  will  give  a  general  overview 
of  genetic  algorithms  as  a  tool  for  optimization. 


3  Genetic  Algorithms 


Genetic  algorithms  are  stochastic  search  methods  which  provide  a  near  optimal  solution 
to  the  evaluation  function  of  an  optimization  problem  [30].  Originally  developed  by 
Holland  [31],  they  can  be  used  to  search  complex,  multimodal  surfaces  via  steps  which 
have  been  designed  to  mimic  the  processes  of  natural  genetic  systems.  They  work 
simultaneously  on  a  number  of  possible  solutions  to  help  prevent  the  algorithm  from 
getting  stuck  in  a  local  optimum.  With  little  adjustment  -  in  most  cases  by  only 
redefining  the  fitness  function  -  GAs  can  be  applied  to  a  wide  range  of  modeling 
problems.  Thus  it  is  possible  to  use  the  same  basic  algorithm  to,  for  example,  fit 
piecewise  polynomials  satisfying  different  optimization  criteria  to  the  same  dataset. 
Situations  related  to  various  research  areas  -  including  scheduling,  classification,  and 
pattern  recognition  *  in  which  the  effectiveness  of  GAs  has  been  demonstrated  can  be 
found  in  the  literature  [32].  Some  good  references  on  GAs  include  Goldberg  [33]  and 
Holland  [34]. 

To  find  a  near  optimal  solution  to  a  given  problem,  a  genetic  algorithm  needs  only  (1) 
a  solution  space  representation,  (2)  a  choice  of  operators  for  performing  the  search, 
and  (3)  a  criterion  or  fitness  function  to  determine  the  value  of  a  proposed  solution. 
The  solution  space  representation  is  used  to  encode  each  candidate  solution  as  a  string 
or  chromosome;  a  set  of  such  chromosomes  is  called  a  population.  Starting  with  a 
randomly  generated  population,  at  each  iteration  the  algorithm  applies  the  chosen 
operators  to  the  given  population  to  yield  a  new  population  of  strings.  The  standard 
operators  are  selection,  crossover,  and  mutation.  Selection  gives  members  of  the  present 
population  with  large  fitness  values  an  increased  chance  of  being  present  in  the  next 
population.  Crossover  or  mating  allows  pairs  of  strings  to  combine  their  better  features 
to  create  an  improved  string  for  the  next  population.  Mutation  gives  the  algorithm  an 
opportunity  to  branch  into  previously  unexplored  regions  of  the  domain  space  -  this 
helps  avoid  premature  convergence  to  a  local  optimum.  An  elitism  step,  where  the 
best  individual  from  the  present  population  is  included  in  the  subsequent  population, 
is  often  incorporated.  This  cycle  of  opnerations  is  repeated  until  some  termination 
criterion  is  met,  at  which  time  the  best  string  achieved  is  generally  taken  as  the  solution 
to  the  optimization  problem.  The  fitness  function  is  used  to  determine  the  value  of 
the  solution  represented  by  a  given  string  and  hence  the  best  solution  is  represented 
by  the  string  which  maximizes  the  fitness  function. 

It  has  been  proven  that  by  including  elitism  in  the  genetic  algorithm  model,  one  can 
ensure  theoretically  that  the  GA  will  converge  to  the  global  optimum  for  a  sufficiently 
large  number  of  iterations  (Bhandari,  Murthy,  and  Pal  [35]).  This  convergence  is 
independent  of  the  choices  for  the  various  algorithm  parameters  (c.g.,  crossover  and 
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mutation  probabilities,  population  size).  The  values  of  these  parameters  only  influence 
the  rate  of  convergence. 

We  refer  the  reader  to  Pittman  and  Murthy  [36]  for  a  more  detailed  overview  of  the 
basic  GA  framework  and  operators. 


3.1  Why  GAs? 

In  the  context  of  the  present  problem,  it  is  necessary  to  find  not  only  the  proper 
knot  placement  but  also  the  proper  number  of  knots.  For  a  fixed  number  of  knots, 
the  sum  of  squares  error  is  a  nonconvex  function  of  the  knot  sequence.  For  example, 
Jupp  [17]  notes  that  for  the  titanium  heat  dataset  of  de  Boor  [5],  if  one  fits  9  variable 
knot  B-splines  of  order  4  by  least  squares  then  the  function  to  be  minimized  has  four 
stationary  points  -  one  interior  optimum,  two  local  minima,  and  one  saddle  point!  Thus 
a  traditional  derivative-based  approach  may  get  stuck  in  a  local  minimum  depending 
upon  the  choice  of  the  initial  knot  sequence.  With  genetic  algorithms,  as  mentioned 
above,  one  does  have  theoretical  convergence  to  the  global  optimum.  In  order  for 
this  convergence  to  occur  one  does  not  need  very  good  initial  estimates  of  the  knot 
locations. 

Determining  the  proper  ‘free’  knot  spline  for  a  given  dataset  can  be  viewed  as  a  variable 
selection  problem:  given  a  large  set  of  candidate  knot  locations  and  a  criterion  of 
fit,  find  the  subset  of  knots  of  a  certain  size  which  yields  the  best  fit.  Exhaustive 
enumeration  is  not  an  option  due  to  the  size  of  the  domain  space  while  the  various 
stepwise  and  stage  wise  procedures  popular  in  the  literature  are  necessarily  suboptimal 
[30].  Hence  genetic  algorithms,  by  performing  a  directed  search  over  the  model  space 
without  resorting  to  stepwise  methods,  have  the  potential  to  better  address  this  sort 
of  problem. 

Given  the  above  remarks  a  note  of  caution  about  GAs  is  in  order.  Although  their 
convergence  to  the  global  optimum  has  been  proven,  the  choice  of  the  various  algorithm 
parameters  -  population  size,  string  length,  crossover  and  mutation  probabilities  -  does 
affect  the  rate  of  convergence.  The  nature  of  this  dependence  is  generally  unknown 
so  the  selection  of  parameter  values  to  achieve  the  best  performance  can  be  a  difficult 
task.  GAs  are  very  computer  intensive  and  hence  slower  than  most  existing  methods. 
Finally,  different  runs  of  the  same  GA  can  lead  to  different  results  (although  the 
results  are  usually  reasonable  solutions).  Hence  the  nature  of  the  current  problem 
may  motivate  the  development  of  a  genetic  algorithm  based  method,  but  GAs  may  not 
be  an  appropriate  optimization  tool  in  other  modeling  contexts. 


4  Formulation  of  Problem  and  Proposed  Method 


We  specify  the  details  of  the  problem  of  interest  and  propose  a  a  method  based  on  a 
modified  variable  length  genetic  algorithm  for  finding  a  near  optimal  solution. 


4.1  Problem  statement 


Consider  a  given  univariate  dataset  {(x,-,j/,)  :  i  =  1, . . . ,  AT},  e  where  the 

{s,-}  satisfy 

a<xi<X2<--' <xn  <b  a,be'R 

The  observed  functional  relationship  between  x  and  y  may  be  represented  as 

Vi  =  fi^i)  +  Ct'  i  =  1, . . . ,  N 

where  /  is  a  continuous  function,  /  €  C[a,6],  and  the  c,s  are  independent  and  identical 
lollowing  a  distribution  with  mean  zero.  We  wish  to  approximate  the  function  /  so  that 
a  criterion  based  on  the  sum  of  the  squared  errors  (i.e.,  sum  of  squares  error)  is  mini¬ 
mized.  We  shall  represent  this  criterion  as  <^(||y  —  5r(x)||55)  for  a  given  approximation 
g  to  /,  where 


4.2  Solution  space 

Our  model  space  is  the  class  of  continuous  piecewise  polynomials  on  [a,  6]  of  order  m. 
More  specifically,  let  be  a  strictly  increasing  sequence  of  points  in  [a,  6]  with 

7-1  =  a,  Tk+2  =  6  and  ^  a  positive  integer,  k  <  k^n^^  given.  Let  be  a 

sequence  of  polynomials  of  order  m  where,  if  g  is  defined  as 

g{x)  =  Pi{x)  if  Ti<x<  Ti+i,  i  =  l,...,k  +  l 

and  is  a  given  sequence  of  positive  integers,  1  <  u,  <  m  -  1,  then  the  first  Vi 

derivatives  of  g  at  Ti  are  continuous.  Our  model  space  may  be  represented  as  Vmrv 
where 

^m,T,v  =  {g  :  g{x)  =  P,(x)  if  r,  <  x  <  r.+i,  i  =  1, . . . ,  A:  -h  1;  the  first  Vi 
derivatives  of  g  at  r,  are  continuous;  and 

are  given  sequences,  k  <  fcmax} 

For  computational  reasons  and  to  avoid  ill-conditioning,  each  element  g  G  'Pm,T,v  will 
be  represented  as  a  linear  combination  of  normalized  B-spline  basis  functions  of  order 
m.  Hence  for  fixed  k  and  given  {r.}{'+2  let  t  =  (ti,  . . . ,  tn^m)  be  such  that 

tl  =■  •  •  ■  =  tm  =  Ti  -  a  <  <  -  <b  =  Tk+2  =  U+l  =  •  •  •  =  Ln+m, 

n  -f-  XI, ■•=2  (77*  t7,),  and  each  t,-  occurs  Vi  times  in  t.  Then  'P,n,T  v  may  be  expressed 

as  ’  ' 

n 

=  {56  Vm,T,v  •  5^  =  ^  ctj  G  7Z^  j  =  1, . . . ,  n} 

i=i 

where  represents  the  sequence  of  normalized  B-splines  of  order  m  with 

respect  to  the  knot  sequence  t.  The  explicit  representation  of  in  terms  of  m-th 
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divided  differences  of  the  truncated  power  functions  will  not  be  given  here;  see  de  Boor 
[5]. 

It  is  known  that  the  problem  of  best  approximation  always  has  a  solution  in  the  space 
of  splines  with  multiple  knots  but  such  a  solution  does  not  always  exist  in  the  space 
of  splines  with  simple  knots  [6].  However,  for  implementation  reasons  we  will  restrict 
ourselves  to  only  simple  knot  splines,  i.e.,  splines  for  which  t>,-  =  m  - 1,  z  =  2, . . . ,  A;  + 1, 
and  n  =  m  +  k. 

If  we  let  B(t)  denote  the  matrix  of  dimension  x  of  B-splines  of  order  tn  with 
respect  to  t,  so  that 

~  i=i . 

then  the  problem  of  interest  can  be  restated  as 

min(/>(||y-»aB(t)||) 


4.3  The  genetic  algorithm  model 


The  solution  space  under  consideration  contains  models  of  various  sizes,  i.e.,  the  num¬ 
ber  of  knots  is  allowed  to  vary.  Hence  when  we  represent  each  model  as  a  string  or 
chromosome  where  each  parameter  is  represented  by  a  certain  number  of  characters, 
the  resulting  strings  will  be  of  different  lengths  (see  below).  The  genetic  algorithm 
we  employ  must  be  capable  of  handling  populations  where  the  string  length  is  not 
constant.  One  such  algorithm  is  the  variable  length  genetic  algorithm  (VLGA). 


4.3.1  Variable  Length  Genetic  Algorithm  (VLGA) 

The  variable  length  genetic  algorithm  was  designed  by  Bandyopadhyay,  Murthy,  and 
Pal  [37]  for  optimizing  the  number  of  hyperplanes  needed  to  classify  patterns  in  a 
multidimensional  feature  space.  It  is  able  to  consider  solutions  with  varying  numbers 
of  planes  by  using  strings  of  varying  lengths.  For  example,  a  set  of  h  hyperplanes, 
i  —  ^  ^  /^max  known,  could  be  represented  by  the  string 

-S’ =  Ifi  €.  a  Vi  =  1,...,A*L 

where  A  is  the  set  of  possible  character  values  and  L  characters  are  used  to  represent 
each  plane.  For  example,  if  the  characters  of  the  strings  are  bits,  then  A  =  {0,1}. 
Hence  the  string  length  is  not  fixed  but  variable  -  for  a  given  /i  €  ^  =  {1, . . . ,  /i^ax}) 
the  string  length  \s  h*  L.  The  algorithm  includes  the  basic  stages  -  selection,  crossover, 
and  mutation  -  as  well  as  elitism.  The  elitism  stage  ensures  that  the  algorithm  will 
converge  to  an  optimal  solution,  theoretically  (see  Section  4.5.3). 

Modified  versions  of  crossover  and  mutation  were  designed  to  handle  strings  of  varying 
lengths.  In  the  following  we  assume  that  the  characters  of  the  strings  are  bits. 


•  crossover 


Two  strings  are  chosen  at  random  for  single  point  crossover.  If  two  strings  of 
different  lengths  are  chosen,  then  the  shorter  string  is  padded  with  #s  (a  place¬ 
holder)  so  that  the  lengths  of  the  two  strings  are  equal.  If  crossover  is  performed, 
the  hyperplanes  in  each  resultant  string  are  either  complete  (all  Os  and  Is  or  all 
^s)  or  incomplete.  Let  h  be  a  hyperplane  and  define 

_  number  of  in  representation  of  h 
Pcomp  number  of  bits  in  representation  of  h 

For  each  incomplete  hyperplane,  with  probability  pcomp  all  are  converted  to 
Os  or  Is  (for  each  bit  the  choice  of  0  or  1  is  made  at  random);  else  all  bits  are  set 
to  ^s.  Once  all  of  the  hyperplanes  are  complete,  those  hyperplanes  consisting 
only  of  are  shifted  to  the  end  of  the  string. 

As  an  example,  suppose  /i,nax  =  3,  L  =  4,  and  the  pair  of  strings  chosen  for 
crossover  is 

0  10  111  and  1  0  0  0  1  0  1  0  1 

The  first  string  is  padded  with  #s  so  that  both  strings  have  length  9,  yield¬ 
ing 

010111^##  and  100010101 

If  the  position  chosen  for  crossover  is  between  the  seventh  and  eighth  bits,  then 
after  crossover  the  resulting  strings  are 

0  10111#01  and  1000101## 

Each  string  contains  one  incomplete  hyperplane;  with  probability  Pcomp  =  1/3 
the  bits  in  the  incomplete  hyperplane  in  the  first  string  are  converted  to  Os  or 
Is,  and  with  probability  Pcomp  =  2/3  the  bits  in  the  incomplete  hyperplane  in 
the  second  string  are  converted  to  Os  or  Is.  If  we  assume  that  the  bits  in  the 
incomplete  hyperplane  in  the  first  string  were  all  set  to  #s  while  the  bits  in  the 
incomplete  hyperplane  in  the  second  string  were  converted  to  Os  or  Is,  then  the 
resulting  strings  may  look  like 

010111###  and  100010100 


•  mutation 

Mutation  can  increase  or  decrease  the  string  length.  All  strings  are  initially 
padded  to  be  of  maximum  length  (/imax  *  L).  Conventional  mutation  is  applied 
to  each  defined  bit  (i.e.,  0  or  1)  with  probability  Pmi',  if  a  bit  is  not  mutated 
then  it  is  set  to  #  with  probability  Each  undefined  bit  (#)  is  mutated 
to  a  defined  value  with  probability  Pma-  Incomplete  hyperplanes  which  result 


are  handled  as  in  crossover  (see  above).  If  a  string  of  all  ^s  is  generated,  it  is 
assigned  a  minimum  fitness  value  and  subsequently  removed  by  selection. 

The  mutation  probability  may  vary  over  iterations,  initially  taking  a  high  value  (close 
to  0.5),  then  decreasing  to  a  prespecified  minimum  («  1/M,  where  M  represents 
the  population  size),  then  increasing  again  in  the  later  stages  of  the  algorithm  (see 
[38]).  When  the  algorithm  has  little  knowledge  of  the  search  space,  a  high  mutation 
probability  encourages  it  to  explore  its  domain.  As  the  number  of  iterations  increases 
the  algorithm  will  move  towards  a  solution  so  the  mutation  probability  is  decreased, 
allowing  for  a  search  in  the  vicinity  of  that  solution.  However,  this  solution  may 
not  represent  the  global  optimum,  so  the  mutation  probability  is  again  increased  to 
expand  the  search.  In  this  way  premature  convergence  to  a  suboptimal  solution  may 
be  avoided. 

The  fitness  criterion  was  defined  so  that  the  set  of  hyperplanes  with  the  maximum 
fitness  value  would  (1)  have  the  minimum  number  of  misclassifications  of  any  set  of 
h  hyperplanes,  h  £  'H,  and  (2)  have  the  fewest  number  of  hyperplanes  of  all  sets  of 
hyperplanes  satisfying  (1).  Hence  the  fitness  function  could  be  represented  a.s 

fit{S)  =  {N  -  misss)  + 

'^max 

where  N  is  the  number  of  data  points,  niisss  is  the  number  of  points  misclassified  by 
the  set  of  hyperplanes  represented  by  S,  hs  is  the  number  of  hyperplanes  represented 
by  5,  and  fimax  is  the  maximum  possible  number  of  hyperplanes.  Hence  maximizing 
the  fitness  function  will  yield  a  set  of  hyperplanes  which  is  parsimonious  yet  minimizes 
the  number  of  misclassifications. 

Instead  of  representing  sets  of  hyperplanes,  our  strings  will  represent  variable  knot 
splines.  For  a  spline  with  n  +  m  knots,  since  the  first  m  knots  are  placed  at  a  and  the 
last  m  knots  are  placed  at  6,  it  is  necessary  to  encode  only  the  remaining  n  —  m(=  k) 
interior  knots.  If  characters  are  used  to  represent  each  knot  location  ti  then  a  set  of 
k  interior  knots  may  be  represented  by  the  string 

5  =  (7i,72,...,7fc.,J;  7i  €  {0,1}  Vi  =  1,...,A;*4 

Given  S  the  corresponding  least  squares  spline  of  order  m  can  be  computed  using  exist¬ 
ing  algorithms  (see  Section  5).  The  least  squares  solution  is  unique  if  and  only  if  that 
matrix  of  B-spline  coefficients  has  full  rank,  i.e.,  if  and  only  if  ||/||2  =  (l^,(/(-'i-’t)^)*'^^  is 
a  norm  on  Sm,t-  One  way  to  ensure  that  ||  •  II2  is  a  norm  is  provided  by  the  Sclioenberg- 
Whitney  Theorem: 

Theorem  4.1  [Schoenberg  —  Whitney  \39\  ]  The  matrix  B(t)  is  invertible  if  and  only 
if  there  exists  a  subset  of  {x,}  such  that 

i.e.,  if  and  only  if  f,-  <  x,-,  <  U+m  V  i. 

A  ^ 
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We  enforce  the  above  condition,  which  is  equivalent  to  setting  the  minimum  span  of 
each  basis  function  to  one  observation. 

The  proposed  algorithm  does  include  an  elitist  stage,  as  was  done  in  the  VLGA,  to 
ensure  algorithm  convergence  to  an  optimal  solution.  We  now  discuss  the  choice  of 
fitness  function. 


4.4  Fitness  function 


The  VLGA  fitness  function  is  based  on  two  criteria  -  minimization  of  the  number  of 
misclassifications  and  the  number  of  hyperplanes.  Similarly,  we  are  interested  satisfying 
two  criteria  -  minimizing  the  least  squares  error  of  the  fit  and  the  required  number  of 
knots  (i.e.,  the  required  degree  of  fit).  However,  since  our  objective  is  LS  modeling 
as  opposed  to  classification,  we  shall  use  a  fitness  function  based  on  model  selection 
criteria  from  the  statistical  literature.  A  popular  class  of  criteria  for  determining  the 
appropriate  model  g  are  statistics  based  on  generalized  cross-validation  [20].  These 
measures  of  goodness-of-fit  are  based  on  the  idea,  borrowed  from  parametric  linear 
regression,  of  minimizing  the  residual  sum  of  squares  adjusted  for  the  amount  of  fitting 
being  done  by  the  model,  or,  in  other  words,  for  the  increased  variance  associated  with 
increased  model  complexity.  The  amount  of  fit  is  represented  by  the  number  of  ‘degrees 
of  freedom’  used  by  the  model.  In  spline  fitting,  the  ‘degrees  of  freedom’  of  a  spline  g 
is  defined  as  some  function  of  the  smoothing  matrix  S  where 

^(y)  =  Sy 


Some  common  definitions  [40]  are: 

1.  df  =  tr(SS‘) 

This  definition  is  motivated  by  the  equation  for  linear  models 

var(y;)  =  pcr^ 


where  p  =  df. 

2.  df  =  tr(2S  -  S'S) 

Note  that 


EiBSS)  =  [N-  tr(2S  -  S‘S)]cr2  -h  /'(I  -  S)‘(I  -  S)/ 

If  we  are  simply  smoothing  noise,  so  that  /  =  0,  then  this  definition  of  df  is 
the  expected  drop  in  RSS  simply  due  to  overfit.  In  the  case  of  comparing  two 
different  fits  this  definition  is  particularly  convenient  since 

£;(RSSi  -  RSS2)  =  [tr(2Si  -  SjSi)  -  tr(2S2  -  S^S2)]<7^ 
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3.  df  =  tr(S) 


This  follows  the  definition  of  Mallows’  Cp  where  the  factor  2p<T^  is  added  to  the 
RSS  to  make  is  unbiased  for  the  predicted  MSE.  tr(S)  is  popular  in  the  spline 
literature  because,  under  the  appropriate  Bayesian  assumptions,  the  posterior 
covariance  of  y  is  Scr^. 


Definition  3  is  a  reasonable  choice  since 


1.  For  symmetric  shrinking  smoothers  with  nonnegative  eigenvalues,  we  have 

SS'  <  tr(S)  <  tr(2S  -  S'S) 

(actually,  with  our  model  class  all  three  values  are  the  same),  and 

2.  tr(S)  =  where  A  is  the  ith  eigenvalue  of  S,  so  this  definition  has  the 

advantage  of  being  easy  to  compute. 


This  choice  yields  the  GCV  criteria  as  originally  stated  by  Craven  and  Wahba  [20], 


GCV(g)  = 


^RSSp 

{1  -  tr(S)/A^}2 


where  RSSg  is  the  residual  sum  of  squares  from  the  fit  of  the  model  g  to  the  data 
and  S  is  the  smoothing  matrix  corresponding  to  g.  With  the  given  class  of  models  the 
trace  of  S  is  simply  the  number  of  basis  functions  being  fit.  Hence  tr(S)  =  n. 


A  recent  advance  in  additive  modeling,  where  the  focus  is  on  fitting  regression  splines, 
is  to  adjust  the  GCV  criterion  by  inflating  the  degrees  of  freedom  to  account  for  the 
adaptive  nature  of  the  search  for  basis  functions.  This  leads  to  a  statistic  of  the  form 


GCV(gn)  =  7 - ^ - 

(1  -  (2  +  (n  -  2)d)/Ny 

as  used  in  MARS  modeling  [25]  with  d  =  3.  Here  we  denote  as  to  explicitly 
show  the  number  of  basis  functions  of  which  g  is  composed.  Luo  and  Wahba  [29]  have 
suggested  the  use  of  d  =  1.2  for  cubic  spline  basis  functions.  Although  the  procedure  is 
not  stepwise,  genetic  algorithms  also  employ  an  adaptive  procedure  for  the  selection  of 
basis  functions.  For  this  reason  our  function  <^||y-5'(x)||  (see  Section  4)  will  be  of  the 
above  form.  However,  the  ‘best’  model  will  minimize^  not  maximize,  a  criterion  of  this 
foim.  This  ca,n  be  rectified  by  defining  c  to  be  an  appropriately  chosen,  large  constant 
c  and  maxirnizing  the  function  c  —  GCV (gn)  over  the  given  model  space.  Hence  our 
fitness  function  may  be  expressed  as 


FF(^„)  =  c- 


{l-{2  +  {n-2)dm)/Ny 


(4.4) 


where  gn  is  of  order  m.  For  basis  functions  of  order  m  =  2  of  77i  =  4  (linear  or  cubic), 
the  values  for  d^  suggested  above  will  be  employed.  Otherwise,  an  appropriate  value 
for  dm  can  be  found  by  cross-validation  or  simulation. 
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4.5  Approximation  and  convergence 


To  justify  the  method  outlined  above  we  briefly  discuss  some  approximation  properties 
of  spline  functions  and  the  convergence  of  the  proposed  genetic  algorithm  to  the  global 
optimum. 

Recall  that  the  observed  functional  relationship  between  x  and  y  is  assumed  to  have 
the  form  y,-  =  /(a:,)  +  e,  for  some  continuous  function  /  where  Cj  are  i.i.d.  with  mean 
zero.  If  our  goal  is  to  approximate  /  then  our  model  class  Sm,t  should  asymptoti¬ 
cally  contain  elements  whose  distance  from  /  is  arbitrarily  small.  We  shall  verify  this 
by  stating  the  following  theorems  which  can  be  found,  along  with  the  corresponding 
proofs,  in  de  Boor  [5]. 


4.5.1  Distance  from  splines 

We  first  consider  the  distance  of  a  function  h  from  where  h  satisfies  various 

conditions.  We  defer  the  discussion  of  least  squares  approximation  to  the  next  section. 

Define  l|h||  =  maXa<i:<6  \h{x)\. 

Theorem  4.2  Let  g  G  3'iid  g  G  C[a,6].  Let  |t|  =  max,-  Af,-  where 
Ati  =  —  ti.  Then 

dist{h,S^,t)  =  min{||/i  -  g\\  :  g  6  <  c,n  •  |tl) 

where  u>{h]  |t|)  =  max{|/i(r)  —  h{s)\  :  r,  s  €  [a,  6],  —  5|  <  lt|}  and  Cm  is  a  constant 
whose  value  depends  only  on  m. 

In  the  above  theorem,  Cm  may  be  taken  as  [(m  +  1)/2J. 

Hence  for  any  fixed  order  m,  we  can  approximate  h  to  any  degree  of  accuracy  if  we  are 
willing  to  use  an  arbitrary  number  of  knots.  In  the  case  that  h  is  smooth,  the  above 
bound  can  be  improved  considerably. 

Theorem  4.3  Let  t  =  {L}"=r  satisfy 

•  •  •  =  =  Cl  <;  tfn+l  ^  <  6  =  ^n+l  =  •  •  •  =  tn+m- 

Then  for  j  =  0, . . . ,  m  -  1  there  exists  c^j  such  that  for  any  h  G  b], 

d\st{h,S^,t)<CrnML0{h^^^;\t\). 

If  j  =  m  —  Ijthen  |t|)  <  and  the  above  imply 

dist(/i,5,„,t)  < 
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Here  Cmj  may  be  defined  iteratively  as  Cm,j  =  =  ni=o 

In  comparison  to  the  bound  of  Theorem  4.2,  the  distance  of  a  smooth  function  from 
Sm,t  goes  to  zero  at  a  rate  at  least  as  fast  as  the  jth  power  of  the  mesh  size  lt|. 

In  practice,  the  number  of  interior  knots  is  bounded,  i.e.,  k  <  for  some  positive 

integer  fcmax-  Hence  we  must  examine  how  well  we  can  approximate  /  with  a  fixed 
number  of  knots  whose  placement  is  allowed  to  vary. 

Let  Sm,n  denote  all  splines  of  order  m  with  some  knot  sequence  satisfying 
f  1  =  •  •  •  =  fm  =  a,  tn+l  =  •  •  •  =  tn+m  =  b,  Tl  fixed. 


Theorem  4.4  Assume  that  the  function  h  6  C[a,6]  is  m  times  continuously  differentiable 
at  all  but  finitely  many  points  and 


oo 


then 


dist(/l,<Sm,n)  <  CmTl 


m 


i.e.,  the  order  of  approximation  is  n 

4 


This  bound  comes  from  using  Theorem  4.3  and  noticing  that  if  is  a  breakpoint 
sequence  chosen  so  that 

r^‘  \D"^h{x)\^/^dx  =  t  \D”^h{x)\^l^dx,  i  =  1, . . . ,  m 

Jri  /c  “h  1  JcL 

Since  Sm,n  C  Sm,t  ioT  k  =  n  —  m  <  kmax,  the  above  order  of  approximation  is  a  lower 
bound  on  the  order  of  approximation  attainable  by  functions  from  Sm,t‘ 

From  the  above  theorems  one  can  conclude  that  Sm,t  is  a  reasonable  approximating 
space  for  the  given  problem. 


diSt{h,Vm,rf]<^[a,b])  <  C,n{k  +  I)’"  ^ 
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4.5.2  Least  squares 


Least  squares  approximation  is  suitable  when  the  objective  is  to  recover  information 
a  ou  a  unctional  relationship  from  a  set  of  noisy  observations.  From  the  above 
theorems  we  know  that  by  allowing  for  many  knots,  our  model  class  will  contain  a 
lunction  whose  distance  from  /  is  arbitrarily  small.  Hence,  as  -)■  oo,  fitting  by  least 
squares  will  theoretically  yield  a  model  whose  distance  from  /  is  negligable.  This  is 

one  consideration  which  motivated  our  use  of  least  squares  in  combination  with  splines 
lor  model  fitting.  ^ 

There  are  some  practical  considerations,  however.  First,  one  can  achieve  a  least  squares 
model  simply  by  allowing  for  N  knots  and  interpolating  the  data.  Given  a  noisy  dataset 
this  IS  something  to  avoid;  thus  not  only  should  the  number  of  knots  be  less  than  the 
number  of  data  points  but  the  model  fitting  criterion  should  penalize  models  with  many 
terms  to  avoid  overfit  .  For  a  fixed  number  of  knots  we  are  finding  the  LS  solution; 
the  penalty  provides  a  way  to  determine  the  ‘proper’  number  of  knots.  Hence  such 
an  adjusted  sum  of  squares  is  a  reasonable  choice  for  our  fitness  function.  Another 
consideration  m  criterion  selection  is  that,  from  a  statistical  standpoint,  we  desire  that 
le  final  model  be  parsimonious  -  for  the  same  amount  of  error,  we  prefer  a  model  with 
ew  terms  Thus  not  only  should  we  choose  a  model  fitting  criterion  which  penalizes 
models  with  many  terms,  but  we  should  also  select  a  model  class  containing  those  spline 
functions  which  maximize  the  amount  of  information  contained  in  each  knot.  Variable 
class  with  an  optimization  of  knot  placement,  represent  such  a 

The  amount  of  irnprovement  in  terms  of  model  fit  that  one  can  achieve  by  allowing 
the  knots  to  vary  is,  of  course,  dependent  upon  the  given  dataset.  On  ‘nice’  datasets 
•fl!  variability,  one  would  expect  less  improvement  over  a  spline  model 

with  fixed  knots  or  an  optimal  knot  distribution  (see  Theorem  4.4)  as  compared  to 
e  improvement  one  would  expect  on  datasets  with  local  properties  where  the  ability 
of  adaptive  knots  to  work  as  local  bandwidth  smoothers  can  be  more  advantageous. 

owever,  given  any  dataset,  optimizing  knot  placement  is  attractive  when  the  number 
of  knots  in  the  final  model  is  of  importance.  Since  our  objective  is  a  parsimonious 
oclel,  optimizing  knot  placement  is  of  key  importance. 


4.5.3  Convergence  to  optimum 

For  fixed  knot  splines,  the  model  space  is  linear  and  finite  dimensional  on  [a,  61.  The 
bchoenberg-Whitney  condition  (see  Theorem  4.1)  ensures  that  the  finite  ‘norm’ 

II5II2  =  (E(^/(®.))'*)'^' 

t 

IS  a  norm  and  hence  a  best  approximation  s*  from  this  space  to  /  with  respect  to  llolU 
exists.  However,  the  space  of  variable  knot  splines  (fixed  dimension)  is  nonlinear  and 
ence  what  constitutes  a  best  approximation  is  difficult  to  characterize.  It  also  may 

fhl  ‘hf  ‘best’  approximation  or  fit  as  a  function  which  maximizes 

e  given  fitness  function  (Eqn.  4.4).  For  a  fixed  number  of  knots  this  is  equivalent  to 
finding  the  least  squares  solution  from  the  class  of  splines  with  simple  ‘free’  knots. 
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As  stated  in  Section  3,  Bhandari,  Murthy,  and  Pal  [35]  have  proven  theoretically  that 
an  elitist  genetic  algorithm  (fixed  length  strings)  will  converge  to  an  optimal  string  as 
the  number  of  iterations  goes  to  infinity.  The  two  characteristics  that  are  necessary 
and  sufficient  for  algorithm  convergence  are 

•  The  optimal  string  from  the  present  population  has  a  fitness  value  no  less  than 
the  fitness  values  of  the  optimal  strings  from  the  previous  populations. 

•  Each  string  has  a  positive  probability  of  going  to  an  optimal  string  within  any 
given  iteration. 

By  preserving  these  properties,  the  variable  length  genetic  algorithm  has  also  been 
shown  to  converge  to  an  optimal  string  [37].  This  convergence  is  independent  of  the 
choice  of  values  for  the  algorithm  parameters  (population  size  A/,  crossover  probability 
Po  mutation  probabilities)  although  these  parameter  values  do  influence  the  rate  of 
convergence.  There  is  no  theory  to  indicate  the  number  of  iterations  necessary  for 
convergence.  Two  popular  heuristic  stopping  rules  are 


•  Execute  the  process  for  a  fixed  number  of  iterations  and  report  the  best  string 
found  as  the  solution. 

•  Execute  the  process  until  the  fitness  value  does  not  show  adequate  improvement 
over  a  fixed  number  of  iterations,  and  report  the  best  string  found  as  the  solution. 

We  will  employ  the  first  stopping  rule  in  our  GA  based  method. 


4.6  Variability  of  results 

For  a  given  dataset  our  method  will  yield  a  variable  knot  spline  model  g  as  an  estimate 
of  the  function  /.  It  is  of  interest  to  estimate  the  variability  of  our  I'esult,  i.e.,  to 
calculate  a  statistic  which  reflects  the  error  of  the  model  given  the  choice  of  modeling 
technique.  Note  that  because  of  the  stochastic  nature  of  the  genetic  algorithm,  different 
applications  of  the  above  method  to  the  same  dataset  will  yield  different  results.  We 
also  have  the  additional  variability  stemming  from  the  adaptive  determination  of  the 
appropriate  set  of  basis  functions.  Our  estimate  of  model  accuracy  should  reflect  both 
of  these  sources  of  error. 

For  a  fixed  knot  sequence,  pointwise  standard  errors  (or  global  confidence  sets)  could 
be  calculated  theoretically.  If  ==  q:B  represents  the  fitted  model,  where  B  is  the  ma¬ 
trix  of  B-spline  basis  functions  (the  dependence  on  the  chosen  knot  sequence  hcis  been 
supressed  in  the  notation)),  dc  is  the  vector  of  basis  coefficients,  and  represents  the 
variance  of  j/t,  z  =  1, . . . ,  A^,  then  pointwise  standard  errors  can  be  derived  as  follows 
[12]: 


If 


S  =  cov(a:)  =  (B‘B) 


then 


cov{g)  =  cov(dB)  =  Bcov(d)B‘  =  BSB‘ 


(4.6) 


<7  can  be  estimated  from  RSS]  the  diagonal  of  cov{g)  contains  estimates  of  the  point- 

wise  variances  so  the  diagonal  elements  of  (BEB*)*/^  are  estimates  of  the  pointwise 
standard  errors. 

One^can  incorporate  the  variability  due  to  the  adaptive  model  search  into  the  estimate 
of  or  by  adjusting  the  degrees  of  freedom,  as  described  in  Section  4.4.  However,  the 
erioi  due  to  the  stoachastic  nature  of  the  GA  has  yet  to  be  considered.  The  nascent 
state  of  GA  theory  suggests  that  the  most  feasible  way  to  attain  a  statistic  which  also 
captures  this  source  of  error  is  via  simulation,  e.g.,  by  bootstrapping  or  other  Monte 
Carlo  methods. 

The  bootstrap  was  introduced  by  Efron  [41]  as  a  computational  method  for  determin¬ 
ing  the  accuracy  of  a  parameter  estimate.  It  is  particularly  useful  in  situations  where 
assessing  accuracy  is  beyond  the  existing  theory  of  the  estimation  method  or  the  prob¬ 
lem  is  too  complicated  for  a  traditional  statistical  analysis.  The  bootstrap  is  one  way 
to  use  computational  results  in  lieu  of  theoretical  analysis  to  effectively  provide  a  for¬ 
mula  for  the  standard  error  as  a  function  of  the  sampling  distribution  of  the  data.  The 
basic  outline  of  the  bootstrap  technique  is  given  below;  a  similar  introduction  can  be 
found  in  Efron  and  Tibshirani  [42]. 

Let  y  =  (t/i, . . , , y;v)  be  observed  values  of  the  random  variables  a;i, . . . ,  ~  i.i.d  F 

foi  some  probability  distribution  F .  The  basic  idea  behind  bootstrap  standard  errors 
IS  to  derive  cr  from  the  empirical  probability  distribution  of  F.  In  other  words,  if 
a{F)  =  [varF,iv(^(y))]i/^  where  0(y)  represents  our  parameter  estimate,  then  a  =  <t{F) 

where  F  places  probability  1/N  on  each  observation,  a  is  evaluated  by  a  Monte  Carlo 
algorithm: 


1.  Draw  a  large  number  J  of  bootstrap  samples  where  each  is  an  indepen¬ 

dent  random  sample  drawn  with  replacement  from  the  original  sample. 

2.  For  each  sample  yj,  j  =  1, . . . ,  J,  calculate  0)  -  9{y]). 

3.  Calculate  the  sample  standard  deviation  of  given  by 

X  y.Qk 
where  0^  =  — ^ 


IZi{e]-90f 
j-i 


a  = 


J 


Essentially  we  are  evaluating  a  standard  deviation  by  Monte  Carlo  sampling.  One 
can  of  course  replace  standard  error  with  some  other  measure  of  error,  such  as  bias  or 
prediction  error,  and  apply  the  same  method.  In  this  CcLse  only  step  3  in  the  above 
must  be  modified. 
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The  number  of  bootstrap  samples  necessary  for  calculating  a  standard  error  is  an  open 
question.  In  general,  a  rough  minimum  sample  size  is  50  <  J  <  200. 

The  above  discussion  suggests  the  following  method  for  estimating  pointwise  standard 
errors  for  our  model: 

Let  g  be  the  model  fit  to  the  original  dataset  (i.e.,  g  is  our  estimate  6). 


1.  Fix  the  algorithm  parameters  (e.g.,  fcmax>  Pc  mutation  probabilities,  maximum 
number  of  iterations)  at  those  values  used  in  determining  g.  Choose  a  value  for 
J,  50  <  J  <  200. 

2.  Draw  J  bootstrap  samples  {yj}/=i  where  each  is  an  independent  random  sample 
drawn  with  replacement  from  the  original  sample  y  =  (j/i, . . . ,  j/yv)- 

3.  On  each  sample  y*,  j  =  1,...,J,  apply  the  GA  method  to  get  an  estimate 
g)  =  fl''’(yj)- 

4.  For  each  j  =  1,...,J,  calculate  the  estimate  <rj  of  the  pointwise  standard 
errors,  o'*  is  defined  as  the  vector  whose  elements  are  the  square  roots  of  the 
diagonal  elements  of  cov(gj)  as  defined  in  Eqn.  4.6  where  is  estimated  as 

_  RSS 

""  ~  (l-(2  +  (n-2)dm)/Ny 


(see  Section  4.4).  Here  n  is  the  number  of  basis  functions  in  Bj  where  gj  =  otjBj 
and  dm  is  as  defined  in  Section  4.4. 

5.  Calculate  the  sample  estimate  of  the  pointwise  standard  errors,  d***,  given  by 


The  above  method,  albeit  heuristic,  uses  the  bootstrap  technique  combined  with  an 
adjusted  pointwise  standard  error  calculation  to  attempt  to  capture  the  various  sources 
of  model  variability. 

Note  that  with  our  method  we  must  run  a  GA  to  get  each  g^  so  the  computational 
expense  of  such  accuracy  estimates  is  considerable.  Whether  this  expense  is  prohibitive 
can  only  be  determined  through  experimentation. 
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5  Proposed  Experiments 


5.1  Methods  and  implementation 

5.1.1  Genetic  algorithm 


We  will  apply  a  VLGA  to  each  dataset  with  chosen  appropriately  (for  some 
datasets  appropriate  values  for  Aimax  can  be  estimated  from  previous  analyses  where 
alternative  model  fitting  methods  were  used).  We  expect  values  in  the  range  of 
5  to  10.  Binary  coding  will  be  used  although  an  alternative  coding  scheme  could  be 
used.  Each  knot  location  will  be  represented  by  Ik  bits.  If  a  given  knot  is  represented 
by  the  bit  sequence  71,72, . . .  ,7,^,,  then  the  value  of  U  is  given  by  the  formula 


U  —  ^(1)  + 


(x(/v)  -  .r-(i))  ^ 

24  _i  2_.7. 

t  =  l 


Hence  4  determines  the  number  of  possible  values  for  each  knot  and  the  minimum 
distance  between  distinct  knots.  Clearly  the  choice  of  depends  on  the  range  of  a:  and 
the  distance  between  consecutive  a;  values,  as  well  as  the  choice  of  k^^^.  A  reasonable 
range  for  4  is  6  ^  /i-  <  15.  The  crossover  probability  will  be  set  at  0.8  and  the 
mutation  probability  will  vary  over  iterations  as  described  in  Section  4.3.1.  The 
other  mutation  probabilities  associated  with  the  VLGA  will  be  set  to  reflect  random 

maximum  number  of  iterations,  MaxNit,  will  be  taken  as  100 
and  10000,  respectively.  The  choice  of  order  will  be  made  from  the  set  {3,4,5}. 

In  cases  where  a  comparable  method  (see  below)  has  previously  been  applied  to  one  of 
_  e  experimental  datasets  described  below,  the  choice  of  certain  parameter  values  will 
be  made  to  ease  the  comparison  of  results. 

For  a  fixed  knot  sequence,  the  least  squares  B-spline  coefficients  will  be  calculated 
using  de  Boor’s  algorithm  L2MAIN  [5]. 


5,1.2  Methods  for  comparison 

The  performance  of  our  method  will  be  compared  to  the  performance  of  three  existing 
methods  for  fitting  variable  knot  splines.  Each  of  these  methods  incorporates  a  method 
lor  selecting  an  optimal’  knot  sequence. 


•  Schwetlick  and  Schutze  (1995) 

A  non-linear  optimization  algorithm  for  least  squares  spline  fitting  with  ‘free’ 
simple  knots.  Instead  of  enforcing  the  Schoenberg-Whitney  condition,  a  regular¬ 
ization  term  of  the  form 

M  ‘  2  /  fixed,  r  €  {0,...,m- 1} 
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for  a  spline  model  of  order  m  is  added  to  the  sum  of  squares  error  and  the 
combination  of  the  two  terms  is  minimized  (hence  the  models  are  smoothing 
splines).  The  number  of  knots  to  include  in  the  model  is  determined  by  a  stepwise 
selection  procedure.  The  objective  is  to  model  the  data  within  an  estimate  of 
its  noise  level  using  as  few  knots  as  possible.  The  required  parameters  (to  be 
selected  by  the  user)  include  an  initial  knot  sequence,  the  order  of  the  spline,  the 
smoothing  parameter  /Li,  and  an  estimate  of  the  noise  level'of  the  data. 

•  Manela  et.  al.  (1993) 

A  genetic  algorithm  is  used  to  fit  B-spline  models  according  to  a  pencilized  least 
squares  criterion  based  on  GCV.  The  smoothing  parameter  is  optimized  while 
the  order  of  the  spline  and  the  number  of  knots  are  optimized  within  the  con¬ 
straints  given  by  Dierckx  [8].  For  a  given  number  of  knots,  the  knot  placement 
is  determined  by  Dierckx’s  algorithm  [8]  which  attempts  an  optimal  knot  distri¬ 
bution.  The  usual  genetic  algorithm  parameters  (e.g,  Pc,  Pm,  MaxNii,  string 
length)  must  be  selected. 

•  HAS:  Luo  and  Wahba  (1997) 

A  cubic  regression  spline  model  is  fit  to  the  given  data  using  reproducing  kernel 
basis  functions.  The  knot  locations  are  restricted  to  a  chosen  subset  of  the  design 
points.  Knots  are  initially  added  to  the  model  by  forward  selection  according  to 
the  GCV  criterion  given  in  Eqn.  4.4.  The  resulting  model  is  regularized  by  a 
ridge  regression  step  in  place  of  backwards  deletion  of  knots,  where  the  ridge  re¬ 
gression  parameter  is  determined  by  GCV.  For  simulated  data,  the  performance 
of  the  final  model  is  evaluated  by  calculating  the  median  MSE  over  replications. 
The  set  of  candidate  knot  locations  and  the  possible  model  sizes  are  determined 
by  the  user. 

The  parameters  for  these  methods  will  be  selected  according  to  previous  applications 

as  well  as  for  the  purpose  of  comparison  with  the  proposed  method. 


5.1.3  Datasets 

The  performance  of  the  above  methods  will  be  compared  on  both  simulated  and  real 
datasets  in  TV.  The  simulated  datasets  are  constructed  by  generating  N  observa¬ 
tions  from  a  given  function  corrupted  with  random  noise,  i.e.,  Cj  ~  i.i.d.  ^(0,cr),  i  = 
1, . . . ,  AT,  for  some  distribution  Q,  a  >  0  given.  One  of  the  three  simulated  datasets  is 
borrowed  from  Luo  and  Wahba  [29], 


•  fi(x)  =  sin(2(4a:  —  2))  -1-  2exp(— ICx^)  x  €  [0, 1] 

with  e,-  ~  i.i.d.  A/"(0,0.4),  and  one  is  acquired  from  Schwetlick  and  Schiitze  [6], 

•  f2{x)  =  10x/(l  +  100a:2)  X  6  [-2,2] 
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with  ei  -  i.iA.  [/m/(-0.05, 0.05).  For  both  of  these  datasets  the  x^s  are  equally 
spaced.  The  remaining  simulated  dataset  will  be  generated  from 

•  /3(z)  —  cos(37r2:)/(0.5  +  4^^) 

where  Zi  ~  Uni f  (0^1)  and  Ci  ~  iA.cl.  A/"(0,0.2). 

The  real  dataset  is  the  Imports  data  set  from  Hand  et.  al.  [43]. 

We  shall  run  several  replications  of  each  method  on  the  third  simulated  dataset  (with 
a  fixed  choice  of  and  compare  the  median  MSE  performance  of  the  above 

methods  as  was  done  in  [29]. 

We  summarize  the  above  in  the  following  table: 


Set  # 

/ 

N 

1 

fi{x)  =  sin(2(4a:  -  2))  +  2exp(-16x'‘) 

256 

AA(0,0.4) 

2 

filx)  =  10x/(l  +  100x2) 

90 

[/m7(-0.05,0.05) 

3 

fsiz)  —  cos(37r2)/(0.5  +  42^) 

100 

Ar(0,0.2) 

4 

Imports 

30 

Table  5.1.3:  Experimental  Datasets 


For  the  function  fi  we  shall  generate  100  bootstrap  samples  and  calculate  point  wise 
standard  errors  for  the  fitted  GA  model  via  the  method  described  in  Section  4.6.  These 
will  be  used  to  plot  ±2  standard-error  bands  with  our  estimate. 


5.2  Expected  results 

We  expect  our  method  to  yield  comparatively  simple  yet  accurate  models  for  several 
reasons.  First,  methods  where  the  knot  distribution  is  optimized  are  not  placing  the 
knots  with  the  objective  of  minimizing  the  sum  of  squares  error.  By  doing  so,  the  GA 
method  should  yield  more  accurate  models  with  respect  to  this  error.  Second,  stepwise 
knot  selection  methods  are  known  to  be  suboptimal  -  in  the  above  method  this  type 
of  search  has  been  replaced  by  a  search  which  theoretically  yields  the  optimal  knot 
sequence.  Finally,  the  GA  method  is  quite  flexibile  in  terms  of  model  selection  since  it 
does  not  restrict  the  possible  knot  locations  to  a  subset  of  the  observations.  We  would 
also  expect  to  obtain  parsimonious  models  since  by  optimizing  knot  placement  we  are 
maximizing  the  amount  of  information  which  can  be  represented  by  each  knot. 

It  should  be  noted  that,  unlike  Manela  et,  al.  [7],  we  are  not  optimizing  the  order  of  the 
spline.  The  prevailing  opinion  from  the  spline  literature  is  that  knot  placement  is  more 
critical  than  the  choice  of  order  [1];  for  this  reason  we  expect  the  benefits  of  optimizing 
knot  placement  to  outweigh  the  disadvantage  of  possibly  having  a  suboptimal  choice 
of  order. 
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These  reasons  aside,  the  amount  of  improvement  of  the  GA  model  will  depend  on  the 
properties  of  the  given  dataset.  As  mentioned  in  Section  4.5,  the  optimizing  of  knot 
placement  will  be  most  beneficial  when  the  dataset  shows  substantial  local  proper¬ 
ties,  e.g.,  high  local  variability,  peaks,  and  regions  of  high  curvature.  If  the  observed 
data  does  not  show  these  types  of  characteristics,  then  the  additional  computational 
resources  required  to  optimize  the  knot  placement  may  be  viewed  as  an  unnecessary 
expense.  It  is  also  possible  that  the  applicability  of  the  method  will  be  restricted 
because  of  high  computational  cost.  In  this  case  the  decision  to  implement  the  GA 
method  on  a  given  dataset  should  only  be  taken  after  a  careful  consideration  of  the 
required  resources  and  the  potential  modeling  benefits. 


6  Conclusions  and  future  research 


A  method  has  been  proposed  for  fitting  variable  knot  splines  where  the  number  of 
knots  as  well  as  their  placement  are  optimized  with  respect  to  an  adjusted  sum  of 
squares  fitting  criterion.  The  optimization  of  knot  placement  should  lead  to  improved 
models,  especially  for  datasets  with  local  properties  such  a  peaks  and  areas  of  high 
variability.  The  resulting  models  should  also  be  parsimonious  in  the  sense  that  each 
knot  may  be  placed  to  maximize  the  amount  of  information  it  contains  regarding  the 
relationship  between  the  observed  variables.  The  expense  of  these  models  in  terms 
of  computational  resources  may,  however,  outweigh  their  improved  performance.  This 
can  only  be  determined  by  experiments  such  as  those  outlined  above. 

If  the  genetic  algorithm  method  does  prove  to  be  substantially  beneficial,  two  directions 
for  future  research  are  under  consideration:  extension  to  high  dimensions  and  other 
model  selection  criteria. 


6.0.1  Higher  dimensions 

Currently  the  most  feasible  approach  for  extension  to  higher  dimensional  data  appears 
to  be  through  additive  modeling,  as  was  done  by  Rogers  [44]  with  MARS  models  [25]. 
We  would  like  to  examine  whether  replacing  stepwise  techniques  with  genetic  algorithm 
optimization  for  knot  sequence  selection  can  lead  to  improved  fitting  of  other  types  of 
additive  models;  more  specifically,  HAS  models  [29]  and  the  polynomial  tensor  product 
spline  models  of  Stone  et.  al.  [45]. 


6.0.2  Model  selection  criteria 

The  statistical  literature  contains  a  rich  class  of  model  selection  criteria  based  on  differ¬ 
ent  measurements  of  model  error  and  degree  of  model  fit.  These  include  statistics  based 
on  Mallows’  Cp  [21],  AIC  [45],  and  various  GCV  criteria  ([20],  [25],  [29]).  Our  choice  of 
model  selection  criteria,  although  supported  by  statistical  arguments,  is  mainly  heuris¬ 
tic.  Other  data  analysts  may  prefer  to  use  different  criteria;  for  this  reason  we  intend 
to  adapt  our  procedure  for  use  with  other  selection  criteria. 
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We  also  recognize  that  least  squares  methods  have  problems  with  outliers,  so  one  may 
prefer  to  use  a  more  robust  criterion.  Here  the  RSS  can  be  replaced  by  a  function  of 
the  form 


N 


Vi  -  gjxj) 


) 


where  ?  is  a  scale  measure,  e.g.,  a.  For  a  fixed  set  of  basis  functions  {5j}j=i,...,n  this 
leads  to  the  modified  normal  equations 


N 


Ui  -  s{xi) 


)  = 


0  j  =  l,...,n 


Popular  choices  for  rj)  are  the  M-estimate  [47] 

«*)={csign(x)  other^e 


for  some  constant  c  or  Andrews’  [46]  sine  function 


xP{x) 


{a^{l  —  cos{x/a))  if  |a;|  <  ttg 
2a^  otherwise 


where  a  is  a  given  constant.  Lenth  [47]  has  fit  robust  cubic  splines  with  V’  as  given  in 
Eqn.  6.0.2;  the  implementation  of  a  similar  fitting  procedure  combined  with  GA  knot 
selection  is  a  current  focus  of  research  interest. 
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